Data (and Metadata) Management Strategies

Kim Cressman and Gabriel Kamener

Catbird Stats, LLC (KC); Florida International University (GK)

2023-11-16

Goals for this presentation

  • NOT to shame you
  • Give you some knowledge to build better datasets
    • some = a manageable amount
    • …moving forward
  • Point you toward helpful resources

Spreadsheets are useful

Even I will not fight that fact

There are multiple purposes for keeping tabular data in spreadsheets:

  • Data entry
  • Data storage
  • Data analysis
  • Presentation

Spreadsheets can be dangerous

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes

Repetition mistakes

data in a spreadsheet

Repetition mistakes

data in a spreadsheet, with two rows outlined in orange

Repetition mistakes

data in a spreadsheet, two rows outlined in orange, one of those rows highlighted in yellow. the data in those two rows should be the same because it references the same sampling plot, but the data in the highlighted line was accidentally pasted from the plot above.

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes
  • Excessive formula use can eventually lead to unseen errors

Excel and dates

  • dates “seen” differently in the computer than in our brains
  • dates seen differently based on operating system

Oprah "you get a car" meme but with dates

Best practices

Helpful sources

  • Broman and Woo 2018, Data Organization in Spreadsheets (open access)
  • White et al. 2013, Nine simple ways to make it easier to (re)use your data (pdf)
  • Tampa Bay Estuary Program Data Management SOP
  • Wickham 2014, Tidy Data (open access)

Rectangles

  • One table per sheet
  • When adding data, add rows, not columns
  • One type of data per column
    • don’t type “No Data” in an otherwise numeric column
  • Be thoughtful about column names
    • and don’t use special characters in them
  • Be thoughtful about representation of missing data

Make information explicit

  • QA/QC columns, rather than comments on a cell
  • Additional columns, rather than [only] color coding

Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time


Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time
  • Tables can be related to each other via common columns, known as “keys”
    • can even do this in Excel, with XLOOKUP

Example: fish monitoring data

Data safety

  • Don’t do any calculations in the raw data file!
    • Make a copy.
  • Back up your data!
    • Keep it in 3 places
    • At least one in a different physical location

When to move beyond spreadsheets

Wait, what’s “beyond” a spreadsheet?

Relational database!

“All relational databases organize data into sets of interlinked tables.” -Thomer and Wickett 2020 (open access)

“A database is, in some sense, just a collection of tables, where there’s some value in the tables that allows them to be connected to each other (the ‘related’ part of ‘relational database’).” -Data Carpentry ‘Data Management with SQL for Ecologists’ workshop

Some software examples

  • Access

  • Oracle

  • MySQL

Advantages

(of a well-built relational database)

  • “Front end” / “Back end”

    • data entry is (can be) human-friendly

    • data validation

    • data storage is computer-friendly

    • all the linkages happen without you having to think about them

Advantages

(of a well-built relational database)

  • Queries - you can pull data back out in different ways

    • e.g., if you wanted the lat/long and habitat information associated with each individual sampling event or even individual fish

      • without the errors you’d get from copying and pasting that information into every sample row
    • WITHOUT altering the original data

Barriers

  • Not everybody has ready access to database expertise

  • Not every database is designed well

  • Good databases require thoughtful design, as well as ongoing maintenance

When to think about databases

  • Long-term projects

  • Projects involving lots of complexity

  • Projects where consistency system-wide is important

Documenting your data

Why

not just for others, but for you and future you and future colleagues down the road

Metadata - what

At its most basic: who, what, why, where, how

Data Dictionaries

add details

JUST DO SOMETHING

Different ways - SWMP metadata, EML
Important thing is to capture the information - reformatting down the road is much easier than if you’d never written it down in the first place